ROCm và HIP: Hướng dẫn chi tiết 10 chương: Bản chất tập trung vào bộ nhớ của hiệu suất GPU

Trong gia tốc GPU, chúng ta phải từ bỏ tư duy "tính toán trước tiên". Hiệu suất hiện đại bị quyết định bởi Quản lý bộ nhớ: sự điều phối việc phân bổ dữ liệu, đồng bộ hóa và tối ưu hóa giữa thiết bị chủ (CPU) và thiết bị con (GPU).

1. Khoảng cách giữa bộ nhớ và tính toán

Trong khi băng thông tính toán của GPU ($TFLOPS$) đã tăng vọt, băng thông bộ nhớ ($GB/s$) lại tăng ở tốc độ chậm hơn nhiều. Điều này tạo ra một khoảng trống khiến các đơn vị thực thi thường xuyên "khát kiệt", phải chờ dữ liệu đến từ VRAM. Vì vậy, Lập trình GPU thường là lập trình bộ nhớ.

2. Mô hình Roofline

Mô hình này minh họa mối quan hệ giữa Độ cường độ tính toán (FLOPs/Byte) và hiệu suất. Các ứng dụng thường được chia thành hai loại:

Giới hạn bởi bộ nhớ: Bị giới hạn bởi băng thông (đường dốc cao).
Giới hạn bởi tính toán: Bị giới hạn bởi đỉnh TFLOPS (đỉnh ngang).

3. Chi phí di chuyển dữ liệu

Điểm nghẽn hiệu suất chính hiếm khi nằm ở phép toán; mà là độ trễ và chi phí năng lượng khi di chuyển một byte qua bus PCIe hoặc từ HBM. Mã hiệu suất cao ưu tiên giữ dữ liệu tại chỗ và giảm thiểu việc truyền dữ liệu giữa thiết bị chủ và thiết bị con.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of a GPU kernel being 'memory-bound'?

The clock speed of the GPU cores is too slow.

The rate of data delivery is slower than the rate of arithmetic execution.

There are too many threads running in parallel.

The CPU is faster than the GPU.

QUESTION 2

In the context of GPU programming, what does 'Memory Management' involve?

Only allocating variables on the CPU stack.

Controlling allocation, synchronization, and optimization of data transfer between host and device.

Optimizing the cache size of the L1 controller.

Manually cleaning the GPU registers after every kernel call.

QUESTION 3

Which axis of the Roofline Model represents 'Arithmetic Intensity'?

Vertical Axis (Y)

Horizontal Axis (X)

The slope of the line.

The area under the curve.

QUESTION 4

Why is redundant host-device transfer considered a 'performance tax'?

It consumes GPU registers.

Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.

It increases the floating-point precision error.

It causes the GPU to overheat instantly.

QUESTION 5

If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?

The math instructions are too complex.

Inefficient orchestration of data residence causing the GPU to wait for data.

The GPU has too much VRAM.

The kernel was written in C++ instead of Python.